[https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown by chienchunhung · Pull Request #15422 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-06-16T17:13:49Z

Background

PR #15363 by @nv-xtf / Tingfeng Xian identified and prototyped fixes for several disaggregated serving failure modes while debugging generation-side KV transfer timeout deadlocks. This PR extracts only the shutdown/teardown lifetime-hardening pieces from that work into a standalone PR so they can be reviewed independently from bounded polling, timeout cancellation, and request-cancellation semantics.

Summary

This change hardens cache transceiver teardown by:

destroying CacheSender and CacheReceiver while the connection manager and transfer plugin are still alive,
treating a null request-info connection as an explicit shutdown/nullopt path instead of returning a default RequestInfo,
avoiding sendResponse() after the response thread wakes for termination without a valid response iterator.

The change is intentionally ungated because this is shutdown/lifetime correctness, not a new cancellation feature.

Dependency graph

Arrows point from prerequisite to dependent. PR numbers in graph nodes are clickable.

This PR is based directly on main; it does not depend on #15181 or #15356. It is shown with a dashed edge into #15238 because it is preferred hardening before the gated cancellation PR, not because it is part of the bounded-polling stack.

graph TD
    PR15139["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15139'>#15139</a>: transfer state consensus (merged)"]
    PR15422["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15422'>#15422</a>: teardown hardening (this PR, open for review)"]
    PR15181["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15181'>#15181</a>: bounded C++ transfer status polling (inflight)"]
    PR15356["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15356'>#15356</a>: bounded V2 context transfer polling (inflight)"]
    PR15238["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15238'>#15238</a>: in-flight cancellation + buffer poison (draft)"]
    WORK_BLOCKALL["blockAll / wait-all cancellation (planned)"]
    WORK_BUFFER["multi-slot buffers + unpoison recovery (planned)"]

    PR15139 -->|satisfied| PR15238
    PR15181 -->|blocking| PR15356
    PR15181 -->|blocking| PR15238
    PR15356 -->|blocking| PR15238
    PR15422 -.->|preferred hardening| PR15238
    PR15238 -.->|planned| WORK_BLOCKALL
    PR15238 -.->|planned| WORK_BUFFER

    classDef merged fill:#dcfce7,stroke:#16a34a,color:#14532d;
    classDef inflight fill:#dbeafe,stroke:#2563eb,color:#1e3a8a;
    classDef draft fill:#ffedd5,stroke:#f97316,color:#7c2d12;
    classDef current fill:#ede9fe,stroke:#7c3aed,color:#3b0764,stroke-width:3px;
    classDef downstream fill:#f3f4f6,stroke:#6b7280,color:#374151,stroke-dasharray:5 5;
    linkStyle 0 stroke:#16a34a,stroke-width:2px;
    linkStyle 1,2,3 stroke:#ea580c,stroke-width:3px;
    linkStyle 4 stroke:#64748b,stroke-width:2px,stroke-dasharray:3 3;
    linkStyle 5,6 stroke:#6b7280,stroke-width:2px,stroke-dasharray:5 5;

    class PR15139 merged;
    class PR15181,PR15356 inflight;
    class PR15422 current;
    class PR15238 draft;
    class WORK_BLOCKALL,WORK_BUFFER downstream;

Scope and relationship to related PRs

Related to [https://nvbugs/6179661][fix] Fix disagg generation-side KV transfer timeout deadlocks and teardown crashes #15363 for the teardown failure modes and motivation.
Independent of [TRTLLM-12721][fix] Bound disagg transfer polling and admission #15356; this PR does not change V2 bounded polling behavior.
Independent of [TRTLLM-12721][fix] Add gated C++ disagg in-flight cancellation #15238; cancellation semantics, transport release, deferred cleanup, and rank-consistent cancellation remain in the gated cancellation PR.
Preferred to merge before [TRTLLM-12721][fix] Add gated C++ disagg in-flight cancellation #15238 if possible, so [TRTLLM-12721][fix] Add gated C++ disagg in-flight cancellation #15238 can rebase on main and inherit the teardown hardening.

Validation

git diff --check
git commit -s pre-commit hooks passed, including clang-format, codespell, duplicate waive checks, and test-list validation. The first hook attempt used system Python 3.9 and failed on Python 3.10 union syntax in scripts/check_test_list.py; rerunning with bundled Python 3.12 on PATH passed.

Summary by CodeRabbit

Bug Fixes
- Improved cache management stability by ensuring proper cleanup of worker threads during shutdown, preventing potential resource issues and system instability.
- Enhanced data reception reliability by implementing graceful error handling for connection failures and edge cases, reducing runtime errors and improving overall system robustness.

chienchunhung · 2026-06-16T17:50:07Z

/bot run --disable-fail-fast

coderabbitai · 2026-06-16T17:55:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7187c1d2-1ec9-45f6-9337-b4f7e8a0f270

📥 Commits

Reviewing files that changed from the base of the PR and between 163be83 and e5b9cc4.

📒 Files selected for processing (2)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

📝 Walkthrough

Walkthrough

The destructor of CacheTransceiver now explicitly resets mCacheSender and mCacheReceiver before closing the plugin handle. In dataTransceiver.cpp, CacheSender::Impl::recvRequestInfo() return type changes to std::optional<RequestInfo>, returns std::nullopt on a null connection, and the response() loop exits early on missing optional or termination; the public wrapper unwraps the optional via TLLM_CHECK.

Changes

Cache transceiver shutdown and null-connection robustness

Layer / File(s)	Summary
CacheTransceiver destructor: reset workers before closing plugin handle `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Adds `mCacheSender.reset()` and `mCacheReceiver.reset()` at the start of the destructor so worker threads stop before the UCX/NIXL/MOONCAKE plugin handle is released.
recvRequestInfo: optional return, null-connection guard, and response-loop early exits `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`	`Impl::recvRequestInfo()` now returns `std::optional<RequestInfo>` and logs a warning and returns `std::nullopt` when the accepted connection is `nullptr`. The `response()` loop exits early when the optional is empty or termination is signalled. The public `CacheSender::recvRequestInfo()` unwraps the optional with `TLLM_CHECK`. Adds `<optional>` include.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

Shixiaowei02
bo-nv
Tabrizian

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: hardening disaggregated cache transceiver teardown with proper reference to the NVBugs ticket and fix type.
Description check	✅ Passed	The PR description is comprehensive with clear Background, Summary, and validation details. However, the Test Coverage and PR Checklist sections from the template are not addressed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-06-16T17:57:00Z

PR_Github #54645 [ run ] triggered by Bot. Commit: e5b9cc4 Link to invocation

tensorrt-cicd · 2026-06-17T17:56:45Z

PR_Github #54645 [ run ] completed with state ABORTED. Commit: e5b9cc4

Link to invocation

chienchunhung · 2026-06-18T00:29:54Z

/bot run --disable-fail-fast

chienchunhung · 2026-06-18T00:34:12Z

/bot run

tensorrt-cicd · 2026-06-18T00:35:57Z

PR_Github #54882 [ run ] triggered by Bot. Commit: 81bcac9 Link to invocation

tensorrt-cicd · 2026-06-18T00:40:02Z

PR_Github #54884 [ run ] triggered by Bot. Commit: 81bcac9 Link to invocation

tensorrt-cicd · 2026-06-18T00:43:18Z

PR_Github #54882 [ run ] completed with state ABORTED. Commit: 81bcac9

Link to invocation

tensorrt-cicd · 2026-06-18T02:58:45Z

PR_Github #54884 [ run ] completed with state FAILURE. Commit: 81bcac9
/LLM/main/L0_MergeRequest_PR pipeline #43889 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-19T01:28:49Z

/bot run

tensorrt-cicd · 2026-06-19T01:34:45Z

PR_Github #54952 [ run ] triggered by Bot. Commit: a0f8bcc Link to invocation

tensorrt-cicd · 2026-06-19T08:42:52Z

PR_Github #54952 [ run ] completed with state SUCCESS. Commit: a0f8bcc
/LLM/main/L0_MergeRequest_PR pipeline #43951 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-21T02:52:29Z

/bot run --disable-fail-fast --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1"

tensorrt-cicd · 2026-06-21T02:58:44Z

PR_Github #54993 [ run ] triggered by Bot. Commit: a0f8bcc Link to invocation

tensorrt-cicd · 2026-06-21T03:40:25Z

PR_Github #54993 [ run ] completed with state SUCCESS. Commit: a0f8bcc
/LLM/main/L0_MergeRequest_PR pipeline #43985 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

chienchunhung · 2026-06-21T19:06:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-21T19:12:21Z

PR_Github #55004 [ run ] triggered by Bot. Commit: a0f8bcc Link to invocation

tensorrt-cicd · 2026-06-21T21:55:39Z

PR_Github #55004 [ run ] completed with state SUCCESS. Commit: a0f8bcc
/LLM/main/L0_MergeRequest_PR pipeline #43996 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

NVIDIA#15422) Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Signed-off-by: GitLab CI Bot <gitlab-ci@nvidia.com>

github-actions Bot assigned chienchunhung Jun 16, 2026

This was referenced Jun 16, 2026

[TRTLLM-12721][fix] Bound disagg transfer polling and admission #15356

Merged

[TRTLLM-12721][fix] Add gated C++ disagg in-flight cancellation #15238

Draft

chienchunhung requested review from Shixiaowei02, chuangz0, nv-xtf and pcastonguay June 16, 2026 17:49

chienchunhung marked this pull request as ready for review June 16, 2026 17:50

chienchunhung requested a review from a team as a code owner June 16, 2026 17:50

chienchunhung mentioned this pull request Jun 16, 2026

[https://nvbugs/6179661][fix] Fix disagg generation-side KV transfer timeout deadlocks and teardown crashes #15363

Closed

1 task

chienchunhung enabled auto-merge (squash) June 16, 2026 18:27

nv-xtf approved these changes Jun 17, 2026

View reviewed changes

chienchunhung force-pushed the codex/disagg-teardown-hardening branch from e5b9cc4 to 81bcac9 Compare June 18, 2026 00:33

[https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown

a0f8bcc

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung force-pushed the codex/disagg-teardown-hardening branch from 81bcac9 to a0f8bcc Compare June 19, 2026 01:28

pcastonguay approved these changes Jun 22, 2026

View reviewed changes

chienchunhung merged commit 2e6abd1 into NVIDIA:main Jun 22, 2026
7 checks passed

Uh oh!

Conversation

chienchunhung commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Summary

Dependency graph

Scope and relationship to related PRs

Validation

Summary by CodeRabbit

Uh oh!

chienchunhung commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

chienchunhung commented Jun 18, 2026

Uh oh!

chienchunhung commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

chienchunhung commented Jun 19, 2026

Uh oh!

tensorrt-cicd commented Jun 19, 2026

Uh oh!

tensorrt-cicd commented Jun 19, 2026

Uh oh!

chienchunhung commented Jun 21, 2026

Uh oh!

tensorrt-cicd commented Jun 21, 2026

Uh oh!

tensorrt-cicd commented Jun 21, 2026

Uh oh!

chienchunhung commented Jun 21, 2026

Uh oh!

tensorrt-cicd commented Jun 21, 2026

Uh oh!

tensorrt-cicd commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chienchunhung commented Jun 16, 2026 •

edited

Loading